Introduction to R

Neil Lund

2024-08-12

What is R?

Background

  • Free and open source successor to S
  • “High Level” language
  • Made for statisticians who need to program (not programmers who need some statistics)

What is R-Studio?

Background

Installing R

Installing R

If you don’t already have it, go to https://cran.r-project.org/ and install an appropriate version of R

R without R-Studio

R in the windows command prompt

The R Graphical Interface

Installing R-Studio

Installing R-Studio

You can install R-Studio from here: https://posit.co/download/rstudio-desktop/

Using the R-Studio Interface

The R-Studio Layout

R-studio interface
  • Write commands in the source window (and save them when you’re done)

  • View output and results in the console

  • See what you’ve loaded or stored in the environment window

  • View and export plots from the output window (and manage packages from the packages tab)

Running code interactively

Type the following in the source window:

# making a histogram from random data 
x <- rnorm(100)
x  
hist(x)  

Click your cursor on the top line and press CTRL + ENTER (CMD + Enter on a Mac). Your cursor should move down and the code will be evaluated. Take note of what happens and where things show up after executing each line.

Line 1

# making a histogram from random data 
x <- rnorm(100) 
x
hist(x) 

Text after a # symbol is ignored by R, so this first line is just a comment. Use comments to remind yourself and others of what your code is doing.

Line 2

# making a histogram from random data 
x <- rnorm(100) 
x
hist(x) 

The <- operator will assign a name to some data, and rnorm is a function that generates random data with a normal (bell-curve) distribution.

Here, I’ve created 100 random numbers and assigned them to a variable named x

Line 3

# making a histogram from random data 
x <- rnorm(100) 
x
hist(x) 
  [1] -0.340274882 -1.045915970 -2.108272191  2.214945624 -0.251412903
  [6] -0.433570992 -0.462448644  0.519613767 -0.047045115 -0.163051456
 [11] -0.090450166 -0.282777885 -1.830488975 -1.048406624  1.083749402
 [16]  0.291632964  1.519507513  0.816828836  0.457385135 -0.309932229
 [21]  1.593614428  0.766053945 -0.663485386  0.707028681 -1.111153531
 [26]  1.652771056  0.282936470 -0.198856856  0.396037848 -0.659970794
 [31]  1.032735921 -0.340453901 -0.479354283  0.010197954  2.070837917
 [36]  0.689180668 -0.736207506 -0.763886565  0.148579376 -0.774859096
 [41]  0.826564802 -2.492514142  1.316786413 -2.793666745  0.855208781
 [46]  0.885227390 -0.331190788 -0.460724469  1.973647106 -0.568130796
 [51] -0.569450113 -0.005972633 -1.142263691 -0.716878304 -1.003563021
 [56]  0.033032952 -1.605704675  0.938489170  0.607815990 -0.667784754
 [61]  1.211662979  0.368878384 -0.967103121 -0.337208858 -1.121476889
 [66]  0.083399165  0.225176852 -1.086001911  0.852149459  0.895411390
 [71] -2.711097133  0.091937272 -1.436931547 -1.771222620 -0.993134276
 [76] -0.011558710 -0.251582223 -1.679057225 -0.704872269  0.572474907
 [81]  1.253973958  0.291712426  0.515047486 -0.273886234 -2.615151425
 [86]  0.256559321  0.311541931  1.229496931  0.186504553  2.255627718
 [91]  1.316541822  0.081123478 -0.660498495  0.345884620  1.819290588
 [96] -0.206281936 -1.275180741  0.010272440 -0.295575996 -0.634963067

If I just reference the variable x, R will print its contents into the console.

Line 4

# making a histogram from random data 
x <- rnorm(100) 
x
hist(x) 

Finally, the hist() function takes some data and plots a histogram. I’m telling R “create a histogram from the contents of variable x”

Running an entire script

Now, with the same script open, press CTRL + SHIFT + S

# making a histogram from random data
x <- rnorm(100) 
x  
hist(x) 

What happened?

Restarting R-Studio

Do the following:

  • Close R-studio

  • When asked if you would like to save the Workspace Image, click “No” (and ideally never click “Yes” ever from now until the end of time)

  • Re-open R studio

  • Try to run just the last line of the code again:

hist(x)

What happened?

Embracing impermanence

  • The data in R’s global environment goes away when you restart R.

  • The order of commands matters: you can’t evaluate X before you create it

  • Embrace it. Store scripts, not data (where feasible)

Dandelion (Taraxacum) Clock - Tennessee, USA - May 31, 2014

Installing Packages

R packages extend the basic functionality of R so you can do more stuff more easily.

To use a package, you first need to:

  1. Install it (just once!) using the install.packages("packagename") function or through the R-Studio interface

  2. Then you need to load it using the library(packagename) command each time you open R.

Installing the Snake package

Install the Snake package by running (note the quotation marks):

install.packages("Snake")

(Or use the graphical interface)

Loading the Snake package

Load the Snake package by running (note the lack of quotation marks)

library(Snake)

Using the Snake Package

Now you can use commands from the Snake package. Try running this:

playSnake()

Admittedly this is not a typical use-case for this tool…

Importing Data

You can use the Import Dataset menu to import data into R with point-and-click commands OR by writing code.

We’ll walk through both approaches using this data set as an example:

https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv

Which contains data from the article “A Statistical Analysis of the Work of Bob Ross by Walt Hickey.

Do people still know who Bob Ross is?

Importing data: point and click

the data import menu in R-Studio

The data import interface

Importing data: using a script

After loading the data with point and click commands, we should copy the code that produced it and save it in our script, that way, we can easily replicate or share our analysis by just re-running the script.

# load the readr package
library(readr)

# use the read_csv function to read the data
elements_by_episode <- read_csv(
  "https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv"
  )

# optionally, use View() to examine the data in the GUI
#View(elements_by_episode)
1
Libraries (aka packages) are basically collections of functions and classes that extend what R can do. Some libraries are loaded by default when you load R, but most will have to be loaded using the library function. The readr package is what allows us to use the read_csv function, which is not a built-in command.
2
Here the data is coming from a URL, but it could just as easily come from a file on your computer. Note that I’ve spread this over multiple lines. In general R doesn’t care about whether I write code on one line or several (however, it does care if I add a line break inside quotation marks!)
3
The View() function is handy, but it can be kind of annoying because it makes a pop-up, so I placed a # in front of this command to turn it into a comment.

Viewing Data

Click the data set name in the environment window to view it. See if you can answer:

  • What is the unit of analysis?

  • How many observations are there (rows)? How many variables (columns)? How are things represented?

  • In the source window, start typing in the name of the data set (elements_by_episode) what happens?

R-Studio: Recap and tips

  1. RStudio is a graphical interface for the R programming language. You don’t need it to use R, but it simplifies a lot of analysis and can make your work faster with things like code completion.
  2. In general, you should use the “source” window to interact with R, and then save your commands in a script file when you’re done.
  3. Use hotkeys like CTRL + ENTER to run single lines of code. Or CTRL + SHIFT + S to run an entire script
  4. Many point-and-click commands in Rstudio will produce R code. You should save this in a script so you can easily replicate your results.

R data

Variables and the assignment operator

We’ve already used the <- operator to store a value and reference it by name. We call this process assignment.

x <- 1

Assignment is very flexible. You can change the value of an existing variable, you can make a copy with a new name, and you can even combine multiple variables into one or add them together.

# create a copy of x called y
y <- x 
# assign a new value to x 
x <- 2
#  add them together
x + y
# add them plus some other number and then assign the results to z
z <- x + y + 34

Variable naming

  • Variable names must start with either a letter or a period. They usually can’t contain spaces or certain special operators like + or -, but you can use these characters if you wrap the name in ` symbols:

  • A good practice is to give descriptive names to variables and use underscores (_) in place of spaces.

# this gives an error about an unexpected symbol: 
my variable <- "Some values"

# but this is okay: 
my_variable <- "Some values"

# in a pinch, you can also do this (but why would you?)
`my variable` <- "Some values"

Data types

In addition to storing data, R variables can have additional attributes and classes that impact how they’re stored, modified, or used in functions.

Mode

One of the most fundamental attributes is a variable’s mode, which is how R knows how to do things like distinguish numbers from text.

x <- 1
mode(x)

Data types (cont’d)

We’ve already seen numeric data. But two other very common ones are:

character which is used for storing text and can be created by entering values inside quotation marks.

mode("abc")

logical which can take values of either TRUE or FALSE and can be created directly with those value OR by writing out a logical comparison like 3 > 4

mode(TRUE)

mode(3 > 5)

factor which is actually just a numeric vector with some labels

factor(c(1,2,3), labels=c("A","B", "C"))

mode(factor(c(1,2,3), labels=c("A","B", "C")))

Why typing matters

What happens when you run this? And why?

"1" + "2"

Why is one of these problematic and the other isn’t?

x <- "ABC"

x <- ABC

Whats up with this?

"10" == 10 
[1] TRUE

Data structures

Data structures allow us to store and perform calculations on groups of numbers or text. The ones we’ll see most often are vector, matrix, data frame and list. We’ll talk briefly about each.

Vectors

vectors store multiple elements of the same type. You can create a vector by passing a comma-separated list to the c() function.

x <- c(1, 2, 3)

y <- c(TRUE, TRUE, FALSE)

z <- c("A", "B", "C")

Vector coercion

The elements of a vector must share a type. If they don’t, then R will “coerce” each element to make them conform.

c("A", 3, 1, "B")
[1] "A" "3" "1" "B"

What happens when you try to create a vector with logical elements and numbers? Why?

c(TRUE, FALSE, 31, 1)

Vector indexing

We can use the [] operator to access specific elements of a vector. For instance, I can get the 2nd element of this vector by writing:

x<- c(1, 2, 3)

x[2]
[1] 2

Vector indexing (cont’d)

We can also use vectors to subset other vectors

x[c(1, 2)]
[1] 1 2

Vector indexing (cont’d)

And we can use a logical comparison to subset.

This comparison creates a logical vector:

# this produces a logical vector
x>1
[1] FALSE  TRUE  TRUE

And this uses a logical vector to subset another vector:

x[x>1]
[1] 2 3

Matrices

A matrix has a single type of data arranged in a fixed number of rows and columns. Here’s a matrix with 3 columns and 5 rows.

mydata = 1:15
my_matrix =matrix(data = mydata, nrow=5, ncol=3)
my_matrix
     [,1] [,2] [,3]
[1,]    1    6   11
[2,]    2    7   12
[3,]    3    8   13
[4,]    4    9   14
[5,]    5   10   15

Matrices as fancy vectors

“Under the hood” a matrix is really just a vector with some extra attributes, so I can subset it just like I would a vector:

my_matrix[1:4]
[1] 1 2 3 4

However, certain functions only make sense for matrices:

colSums(my_matrix)
[1] 15 40 65
rowSums(my_matrix)
[1] 18 21 24 27 30

Matrix indexing by row and column

It usually makes more sense to take an entire row or column from a matrix. To do that, I can use syntax like this:

my_matrix[2, ] # extract the second row of the matrix
[1]  2  7 12
my_matrix[, 1] # extract the 1st column
[1] 1 2 3 4 5
my_matrix[1:3, 3] # the first through third row of the third column
[1] 11 12 13

Data frames

Data frames have rows and columns like a matrix, but:

  1. Columns can contain different types of data
  2. All columns have names
  3. Columns are usually accessed using the data$colname notation. You can still use matrix-style indexing, though!

Data frames

Usually things will just “become” data frames when you import them, but you can also make one yourself with the data.frame() function.

mydf = data.frame("text" = c("car", "dog", "house"),
                  "numbers" = c(1, 3, 5),
                  "booleans" = c(TRUE, TRUE, FALSE)
                  
                  )

mydf
   text numbers booleans
1   car       1     TRUE
2   dog       3     TRUE
3 house       5    FALSE

You can use str() to get a sense of the structure of a complex R object.

str(mydf)
'data.frame':   3 obs. of  3 variables:
 $ text    : chr  "car" "dog" "house"
 $ numbers : num  1 3 5
 $ booleans: logi  TRUE TRUE FALSE

Data frame indexing

Use the $ operator to access entire columns (but not rows!)

mydf$numbers
[1] 1 3 5

Or you can use double brackets followed by a column name in quotation marks

mydf[["numbers"]]
[1] 1 3 5

(or use matrix notation)

mydf[,2]
[1] 1 3 5

Data frames (notes)

  • You’re likely to encounter data frames more than any other type of data. However, many statistical operations will coerce your data to a vector or matrix before actually conducting the analysis.

  • A tibble() is, for all intents and purposes, the same thing as a data frame, but they have fewer weird behaviors.

Lists

Lists are like data frames without any of the restrictions. They can contain any number of types and can even contain other lists or data frames:

mylist = list("letters" = c("A", "B", "C"),
              "scalar" = 10,
              "nested_list" = list("palette_1" = list("red", "blue", "green", "white"),"palette_2" = list("pink","brown", "black"))
              )

mylist
$letters
[1] "A" "B" "C"

$scalar
[1] 10

$nested_list
$nested_list$palette_1
$nested_list$palette_1[[1]]
[1] "red"

$nested_list$palette_1[[2]]
[1] "blue"

$nested_list$palette_1[[3]]
[1] "green"

$nested_list$palette_1[[4]]
[1] "white"


$nested_list$palette_2
$nested_list$palette_2[[1]]
[1] "pink"

$nested_list$palette_2[[2]]
[1] "brown"

$nested_list$palette_2[[3]]
[1] "black"

List indexing

You can also access parts of a list using the $ or [[]] operators.

mylist[["letters"]]
[1] "A" "B" "C"
mylist$letters
[1] "A" "B" "C"

And you can also access parts of a nested list:

mylist[['nested_list']][['palette_1']]
[[1]]
[1] "red"

[[2]]
[1] "blue"

[[3]]
[1] "green"

[[4]]
[1] "white"

List notes

You generally won’t use lists for your analysis because you can’t really summarize their contents easily.

But you’ll still see them used for storing or transporting complex data (such as information used by web services) that can’t fit neatly into a data frame.

Or for R objects that return lots of information:

# a linear regression model using some R data
model<-lm(speed ~ dist, data=cars)
# its a list!
mode(model)
[1] "list"

Subsetting exercise

Using the Bob Ross data frame from earlier:

  1. how would you find all episodes where Ross painted trees? How many are there?

  2. How would you find all of the elements he painted in episode 1?

  3. The “GUEST” column contains a 1 if an episode had a guest host and 0 otherwise. How would you remove all of these so you only had Bob Ross paintings?

R Functions

What is a function?

Functions are basically just blocks of generalized code that can be used over and over again. We’ve already used several:

  • c() concatenates values into a vector
  • matrix() turns a vector into a matrix
  • data frame turns a series of equal-length vectors into a data frame
  • mode() tells you the type of a particular kind of data

We’ve also used operators like + and -. In actuality, these are also a kind of R function.

Infix functions: R is a calculator

Infix Functions

“Infix” functions are so obvious you probably don’t even think of them as functions. Try typing this into the script editor and then send it to R:

108 + 12

Infix Functions

Infix operators will always take a left hand side (LHS) argument and a right hand side argument (RHS)

LSH + RHS

Infix functions (and most R functions) can also take sets of numbers and will handle them pretty intuitively:

c(1, 3, 5) + 20
[1] 21 23 25

Infix Functions

Arithmetic

Operator Usage Example
+, -, /, * plus, minus, divide, and multiply, respectively 1 + 12 /3
%%, %/% Modulo division and integer division

13%%4

13%/%4

^ Raise the left hand side to the power of the right hand side 3 ^ 4

Logical/Comparisons

== test for equality

3 == 4

"Horse" == "Donkey"

!= test for inequality "Horse" != "Donkey"
>, <, >=, <= greater than, less than, geq, leq

45 >= 45

45 > 45

Control/Specialized

<-, =

assign RHS to LHS

( <- is preferred over the equals sign because its less likely to be confused for a comparison)

x <- 13
: make a sequence of numbers from LHS to RHS 1:10
[], [[]], $ Subsetting (get an entire row or a column or single observation) x[1] , cars[4, ], iris$Sepal.Length
~ Used in model formulas (like when we want to estimate the effect of a variable on an outcome) lm(speed~dist, data=cars)

Prefix functions: R is a fancy calculator

Prefix syntax

Most R functions will use prefix notation. Use a prefix function by first writing the name of the function, followed by parentheses with some arguments inside.

For instance, here’s how I can get the sum of a set of numbers:

sum(1, 2, 3)
[1] 6

The help() function

The help function is a special function that brings up information about functions. Run this and see what shows up in the bottom right window of R-Studio.

help(sum)

You can also get help on an infix operator, but you need to wrap it in these: ` so R doesn’t get confused:

help(`+`)

The mean() function

Lets take a closer look at the Default S3 method here for the mean() function

mean(x, ...)
## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
  • x is the main input.

  • two remaining arguments (trim and na.rm) have a name AND default values. This means we don’t have to write out these arguments every time.

Default function arguments

Try running this:

[1] NA

Now try this:

[1] 3.4

Note

R is pedantic about math! There really isn’t a valid way to calculate the mean with NA values, and NA is supposed to be a placeholder, so R doesn’t want to ignore it by default. So you’re forced to be explicit.

Named arguments

One more thing to note here is that we can name each argument explicitly, but we don’t have to:

vector_b <- c(1, 3, 4, 9)

mean(x = vector_b)
[1] 4.25

Now try without specifying the name x:

mean(vector_b)
[1] 4.25

Unnamed Arguments

If we don’t name the arguments, R will assume they’re being given in the order specified in the help file. This can cause problems if we get things out of sequence. Which is why this is okay:

mean(vector_a, na.rm=T)
[1] 3.4

… but this gives an error message:

mean(vector_a, T)
Error in mean.default(vector_a, T): 'trim' must be numeric of length one

User created functions

You can create your own functions. To create a function, you’ll use the following general syntax:

myFunction <- function(arguments){
  # .... do some stuff  here 
  # ....
  result <- "some calculations"
  return(result)
}

Now execute your function just like you would any other R function.

myFunction(arguments = "some arguments")
[1] "some calculations"

Note: built in functions like sum() will be available as soon as you load R, but user-created functions will just be loaded into the global environment, so they go away when you close R.

R functions: recap

  1. Infix functions like +, -, / will take a left hand side argument and a right hand side argument
  2. Prefix functions can be used by writing the function name followed by some arguments in parentheses. This type of function syntax is more common, especially for more complex operations.
  3. Use help() to get information about an R function.
  4. Pay attention to required arguments, argument names, and defaults.
  5. You can create your own functions. If you find yourself copy-pasting a chunk of code over and over, consider creating a function instead.

Analysis exercise

Find the top 10 most common elements Bob Ross painted and create a barplot showing the frequency of each.

Start by writing out the steps you need to perform as comments:

# Step 0. import the data (you should already have the code for this)

# Step 1. remove the columns with episode numbers or titles

# Step 2. get the sum of the numeric columns (there's a way to do this with just one function!)

# Step 3. sort the sums from highest to lowest

# Step 4. take the top 10 elements from this sorted result

# Step 5. create a bar plot from the object created in step 4 

Some tips

1.The computer loves you. It wants you to succeed

  • Error messages are informative! Even if they seem inscrutable.

  • When something doesn’t work, stop, take a breath, and then read the output.

  • Consult the help files

  • Search the internet! Few problems are unique and R has a lot of users.

2. Its not like riding a bike

Its pointless to try to memorize everything. Instead, try to follow good script-writing practices

  • Give objects meaningful names

  • Write lots of comments

  • Save your scripts (also with meaningful names)

  • Use the GUI to figure stuff out, but write the equivalent commands in your script

3. Start specific then generalize

  • Don’t try to tackle complex problems in one fell swoop. Take a couple of observations, figure out the correct answer, and then try to write code that gets you there. Then generalize that case to the entire data set.

  • Talk to yourself: use the comments to outline what you will do before you do it.

4. When all else fails

Ask for help from your classmates or instructors! You’ll get better help if you: share enough code or data to allow someone else to replicate your problem easily.